{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "# Week 3: Exploring Overfitting in NLP\n",
    "\n",
    "Welcome to this assignment! During this week you saw different ways to handle sequence-like data. You saw how some Keras' layers such as `GRU`, `Conv` and `LSTM` can be used to tackle problems in this space. Now you will put this knowledge into practice by creating a model architecture that does not overfit.\n",
    "\n",
    "For this assignment you will be using a variation of the [Sentiment140 dataset](https://www.tensorflow.org/datasets/catalog/sentiment140), which contains 1.6 million tweets alongside their respective sentiment (0 for negative and 4 for positive). **This variation contains only 160 thousand tweets.**\n",
    "\n",
    "You will also need to create the helper functions very similar to the ones you coded in previous assignments pre-process data and to tokenize sentences. However the objective of the assignment is to find a model architecture that will not overfit.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:\n",
    "\n",
    "- All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.\n",
    "\n",
    "\n",
    "- You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.\n",
    "- You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!\n",
    "- Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.\n",
    "- To submit your notebook, save it and then click on the blue submit button at the beginning of the page.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "hmA6EzkQJ5jt",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "import pickle\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "import unittests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining some useful global variables\n",
    "\n",
    "Next you will define some global variables that will be used throughout the assignment. Feel free to reference them in the upcoming exercises:\n",
    "\n",
    "- `EMBEDDING_DIM`: Dimension of the dense embedding, will be used in the embedding layer of the model. Defaults to 100.\n",
    "\n",
    "\n",
    "- `MAX_LENGTH`: Maximum length of all sequences. Defaults to 32.\n",
    "\n",
    "    \n",
    "- `TRAINING_SPLIT`: Proportion of data used for training. Defaults to 0.9\n",
    "\n",
    "- `NUM_BATCHES`: Number of batches. Defaults to 128\n",
    "\n",
    "    \n",
    "**A note about grading:**\n",
    "\n",
    "**When you submit this assignment for grading these same values for these globals will be used so make sure that all your code works well with these values. After submitting and passing this assignment, you are encouraged to come back here and play with these parameters to see the impact they have in the classification process. Since this next cell is frozen, you will need to copy the contents into a new cell and run it to overwrite the values for these globals.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "EMBEDDING_DIM = 100\n",
    "MAX_LENGTH = 32\n",
    "TRAINING_SPLIT = 0.9\n",
    "NUM_BATCHES = 128"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore the dataset\n",
    "\n",
    "The dataset is provided in a csv file. \n",
    "\n",
    "Each row of this file contains the following values separated by commas:\n",
    "\n",
    "- target: the polarity of the tweet (0 = negative, 4 = positive)\n",
    "\n",
    "- ids: The id of the tweet\n",
    "\n",
    "- date: the date of the tweet\n",
    "\n",
    "- flag: The query. If there is no query, then this value is NO_QUERY.\n",
    "\n",
    "- user: the user that tweeted\n",
    "\n",
    "- text: the text of the tweet\n",
    "\n",
    "\n",
    "Take a look at the first five rows of this dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "data_path = \"./data/training_cleaned.csv\"\n",
    "df = pd.read_csv(data_path, header=None)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Looking at the contents of the csv file by using pandas is a great way of checking how the data looks like. Now you need to create a `tf.data.Dataset` with the corresponding text and sentiment for each tweet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Standardize labels so they have 0 for negative and 1 for positive\n",
    "labels = df[0].apply(lambda x: 0 if x == 0 else 1).to_numpy()\n",
    "\n",
    "# Since the original dataset does not provide headers you need to index the columns by their index\n",
    "sentences = df[5].to_numpy()\n",
    "\n",
    "# Create the dataset\n",
    "dataset = tf.data.Dataset.from_tensor_slices((sentences, labels))\n",
    "\n",
    "# Get the first 5 elements of the dataset\n",
    "examples = list(dataset.take(5))\n",
    "\n",
    "print(f\"dataset contains {len(dataset)} examples\\n\")\n",
    "\n",
    "print(f\"Text of second example look like this: {examples[1][0].numpy().decode('utf-8')}\\n\")\n",
    "print(f\"Labels of first 5 examples look like this: {[x[1].numpy() for x in examples]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: train_val_datasets\n",
    "\n",
    "Now you will code the `train_val_datasets` function, which given the full tensorflow dataset, shuffles it and splits the dataset into two, one for training and the other one for validation taking into account the `TRAINING_SPLIT` defined earlier. It should also batch the dataset so that it is arranged into `NUM_BATCHES` batches.\n",
    "\n",
    "In the previous week you created this split between training and validation by manipulating numpy arrays but this time the data already comes encoded as `tf.data.Dataset`s. This is so you are comfortable manipulating this kind of data regardless of the format.\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "\n",
    "- Take a look at the [take](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#take) and [skip](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#skip) methods to generate the training and validation data.\n",
    "\n",
    "\n",
    "- The [batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch) method is useful to split the dataset into the desired number of batches.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: train_val_datasets\n",
    "\n",
    "def train_val_datasets(dataset):\n",
    "    \"\"\"\n",
    "    Splits the dataset into training and validation sets, after shuffling it.\n",
    "    \n",
    "    Args:\n",
    "        dataset (tf.data.Dataset): Tensorflow dataset with elements as (sentence, label)\n",
    "    \n",
    "    Returns:\n",
    "        (tf.data.Dataset, tf.data.Dataset): tuple containing the train and validation datasets\n",
    "    \"\"\"   \n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    # Compute the number of sentences that will be used for training (should be an integer)\n",
    "    train_size = None\n",
    "\n",
    "    # Split the sentences and labels into train/validation splits\n",
    "    train_dataset = None\n",
    "    validation_dataset = None\n",
    "\n",
    "    # Turn the dataset into a batched dataset with num_batches batches\n",
    "    train_dataset = None\n",
    "    validation_dataset = None\n",
    "\n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return train_dataset, validation_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Create the train and validation datasets\n",
    "train_dataset, validation_dataset = train_val_datasets(dataset)\n",
    "\n",
    "print(f\"There are {len(train_dataset)} batches for a total of {NUM_BATCHES*len(train_dataset)} elements for training.\\n\")\n",
    "print(f\"There are {len(validation_dataset)} batches for a total of {NUM_BATCHES*len(validation_dataset)} elements for validation.\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "There are 1125 batches for a total of 144000 elements for training.\n",
    "\n",
    "There are 125 batches for a total of 16000 elements for validation.\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_train_val_datasets(train_val_datasets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: fit_vectorizer\n",
    "\n",
    "Now that you have batched datasets for training and validation it is time for you to begin the tokenization process.\n",
    "\n",
    "Begin by completing the `fit_vectorizer` function below. This function should return a [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer that has been fitted to the training sentences.\n",
    "\n",
    "\n",
    "**Hints:**\n",
    "\n",
    "\n",
    "- This time you didn't define a custom `standardize_func` but you should convert to lower-case and strip punctuation out of the texts. For this check the different options for the [`standardize`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#args) argument of the [TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer.\n",
    "\n",
    "\n",
    "- The texts should be truncated so that the maximum length is equal to the `MAX_LENGTH` defined earlier. Once again check the [`docs`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization#args) for an argument that can help you with this.\n",
    "\n",
    "- You should NOT predefine a vocabulary size but let the layer learn it from the sentences."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: fit_vectorizer\n",
    "\n",
    "def fit_vectorizer(dataset):\n",
    "    \"\"\"\n",
    "    Adapts the TextVectorization layer on the training sentences\n",
    "    \n",
    "    Args:\n",
    "        dataset (tf.data.Dataset): Tensorflow dataset with training sentences.\n",
    "    \n",
    "    Returns:\n",
    "        tf.keras.layers.TextVectorization: an instance of the TextVectorization class adapted to the training sentences.\n",
    "    \"\"\"    \n",
    "\n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    # Instantiate the TextVectorization class, defining the necessary arguments alongside their corresponding values\n",
    "    vectorizer = tf.keras.layers.TextVectorization( \n",
    "        None,\n",
    "    ) \n",
    "    \n",
    "    # Fit the tokenizer to the training sentences\n",
    "    \n",
    "    \n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return vectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Get only the texts out of the dataset\n",
    "text_only_dataset = train_dataset.map(lambda text, label: text)\n",
    "\n",
    "# Adapt the vectorizer to the training sentences\n",
    "vectorizer = fit_vectorizer(text_only_dataset)\n",
    "\n",
    "# Check size of vocabulary\n",
    "vocab_size = vectorizer.vocabulary_size()\n",
    "\n",
    "print(f\"Vocabulary contains {vocab_size} words\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "***Expected Output:***\n",
    "\n",
    "```\n",
    "Vocabulary contains 145856 words\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_fit_vectorizer(fit_vectorizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**This time you don't need to encode the labels since these are already encoded as 0 for negative and 1 for positive**. But you still need to apply the vectorization to the texts of the dataset using the adapted vectorizer you've just built. You can do so by running the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Apply vectorization to train and val datasets\n",
    "train_dataset_vectorized = train_dataset.map(lambda x,y: (vectorizer(x), y))\n",
    "validation_dataset_vectorized = validation_dataset.map(lambda x,y: (vectorizer(x), y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Using pre-defined Embeddings\n",
    "\n",
    "This time you will not be learning embeddings from your data but you will be using pre-trained word vectors. In particular you will be using the 100 dimension version of [GloVe](https://nlp.stanford.edu/projects/glove/) from Stanford."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Define path to file containing the embeddings\n",
    "glove_file = './data/glove.6B.100d.txt'\n",
    "\n",
    "# Initialize an empty embeddings index dictionary\n",
    "glove_embeddings = {}\n",
    "\n",
    "# Read file and fill glove_embeddings with its contents\n",
    "with open(glove_file) as f:\n",
    "    for line in f:\n",
    "        values = line.split()\n",
    "        word = values[0]\n",
    "        coefs = np.asarray(values[1:], dtype='float32')\n",
    "        glove_embeddings[word] = coefs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have access to GloVe's pre-trained word vectors. Isn't that cool?\n",
    "\n",
    "Let's take a look at the vector for the word **dog**:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": true
   },
   "outputs": [],
   "source": [
    "test_word = 'dog'\n",
    "\n",
    "test_vector = glove_embeddings[test_word]\n",
    "\n",
    "print(f\"Vector representation of word {test_word} looks like this:\\n\\n{test_vector}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Feel free to change the `test_word` to see the vector representation of any word you can think of.\n",
    "\n",
    "Also, notice that the dimension of each vector is 100. You can easily double check this by running the following cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "print(f\"Each word vector has shape: {test_vector.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can represent the words in your vocabulary using the embeddings. To do this, save the vector representation of each word in the vocabulary in a numpy array.\n",
    "\n",
    "A couple of things to notice:\n",
    "- You need to build a `word_index` dictionary where it stores the encoding for each word in the adapted vectorizer.\n",
    "\n",
    "- If a word in your vocabulary is not present in `GLOVE_EMBEDDINGS` the representation for that word is left as a column of zeros."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Create a word index dictionary\n",
    "word_index = {x:i for i,x in enumerate(vectorizer.get_vocabulary())}\n",
    "\n",
    "print(f\"The word dog is encoded as: {word_index['dog']}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "C1zdgJkusRh0",
    "outputId": "538df576-bbfc-4590-c3a3-0559dab5f176"
   },
   "outputs": [],
   "source": [
    "# Initialize an empty numpy array with the appropriate size\n",
    "embeddings_matrix = np.zeros((vocab_size, EMBEDDING_DIM))\n",
    "\n",
    "# Iterate all of the words in the vocabulary and if the vector representation for \n",
    "# each word exists within GloVe's representations, save it in the embeddings_matrix array\n",
    "for word, i in word_index.items():\n",
    "    embedding_vector = glove_embeddings.get(word)\n",
    "    if embedding_vector is not None:\n",
    "        embeddings_matrix[i] = embedding_vector"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As a sanity check, make sure that the vector representation for the word `dog` matches the column of its index in the `EMBEDDINGS_MATRIX`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "test_word = 'dog'\n",
    "\n",
    "test_word_id = word_index[test_word]\n",
    "\n",
    "test_vector_dog = glove_embeddings[test_word]\n",
    "\n",
    "test_embedding_dog = embeddings_matrix[test_word_id]\n",
    "\n",
    "both_equal = np.allclose(test_vector_dog,test_embedding_dog)\n",
    "\n",
    "print(f\"word: {test_word}, index: {test_word_id}\\n\\nEmbedding is equal to column {test_word_id} in the embeddings_matrix: {both_equal}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have the pre-trained embeddings ready to use!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: create_model\n",
    "\n",
    "Now you need to define a model that will handle the problem at hand while not overfitting.\n",
    "\n",
    "**Hints**:\n",
    "\n",
    "- The layer immediately after `tf.keras.Input` should be a `tf.keras.layers.Embedding`. The parameter that configures the usage of the pre-trained embeddings is already provided but you still need to fill the other ones.\n",
    "\n",
    "- There multiple ways of solving this problem. So try an architecture that you think will not overfit.\n",
    "\n",
    "\n",
    "- You can try different combinations of layers covered in previous ungraded labs such as:\n",
    "    - `Conv1D`\n",
    "    - `Dropout`\n",
    "    - `GlobalMaxPooling1D`    \n",
    "    - `MaxPooling1D`    \n",
    "    - `LSTM`    \n",
    "    - `Bidirectional(LSTM)`\n",
    "\n",
    "\n",
    "- Include at least one `Dropout` layer to mitigate overfitting.\n",
    "\n",
    "\n",
    "- The last two layers should be `Dense` layers.\n",
    "\n",
    "\n",
    "- Try simpler architectures first to avoid long training times. Architectures that are able to solve this problem usually have around 3-4 layers (excluding the last two `Dense` ones). \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: create_model\n",
    "\n",
    "def create_model(vocab_size, pretrained_embeddings):\n",
    "    \"\"\"\n",
    "    Creates a binary sentiment classifier model\n",
    "    \n",
    "    Args:\n",
    "        vocab_size (int): Number of words in the vocabulary.\n",
    "        pretrained_embeddings (np.ndarray): Array containing pre-trained embeddings.\n",
    "\n",
    "    Returns:\n",
    "        (tf.keras Model): the sentiment classifier model\n",
    "    \"\"\"\n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    model = tf.keras.Sequential([ \n",
    "        tf.keras.Input(None),\n",
    "        tf.keras.layers.Embedding(input_dim=None, output_dim=None, weights=[pretrained_embeddings], trainable=None),\n",
    "    ])\n",
    "    \n",
    "    model.compile( \n",
    "        loss=None,\n",
    "        optimizer=None,\n",
    "        metrics=['accuracy'] \n",
    "    ) \n",
    "\n",
    "    ### END CODE HERE ###\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell allows you to check the number of total and trainable parameters of your model and prompts a warning in case these exceeds those of a reference solution, this serves the following 3 purposes listed in order of priority:\n",
    "\n",
    "- Helps you prevent crashing the kernel during training.\n",
    "\n",
    "- Helps you avoid longer-than-necessary training times.\n",
    "- Provides a reasonable estimate of the size of your model. In general you will usually prefer smaller models given that they accomplish their goal successfully.\n",
    "\n",
    "\n",
    "**Notice that this is just informative** and may be very well below the actual limit for size of the model necessary to crash the kernel. So even if you exceed this reference you are probably fine. However, **if the kernel crashes during training or it is taking a very long time and your model is larger than the reference, come back here and try to get the number of parameters closer to the reference.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Create your untrained model\n",
    "model = create_model(vocab_size, embeddings_matrix)\n",
    "\n",
    "# Check parameter count against a reference solution\n",
    "unittests.parameter_count(model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Take an example batch of data\n",
    "example_batch = train_dataset_vectorized.take(1)\n",
    "\n",
    "try:\n",
    "\tmodel.evaluate(example_batch, verbose=False)\n",
    "except:\n",
    "\tprint(\"Your model is not compatible with the dataset you defined earlier. Check that the loss function and last layer are compatible with one another.\")\n",
    "else:\n",
    "\tpredictions = model.predict(example_batch, verbose=False)\n",
    "\tprint(f\"predictions have shape: {predictions.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Expected Output:**\n",
    "```\n",
    "predictions have shape: (NUM_BATCHES, n_units)\n",
    "```\n",
    "\n",
    "Where `NUM_BATCHES` is the globally defined variable and `n_units` is the number of units of the last layer of your model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_create_model(create_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Train the model and save the training history\n",
    "history = model.fit(\n",
    "\ttrain_dataset_vectorized, \n",
    "\tepochs=20, \n",
    "\tvalidation_data=validation_dataset_vectorized\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**To pass this assignment your `val_loss` (validation loss) should either be flat or decreasing.** \n",
    "\n",
    "Although a flat `val_loss` and a lowering `train_loss` (or just `loss`) also indicate some overfitting what you really want to avoid is having a lowering `train_loss` and an increasing `val_loss`.\n",
    "\n",
    "With this in mind, the following three curves will be acceptable solutions:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table><tr><td><img src='images/valid-1.png'></td><td><img src='images/valid-2.jpg'></td><td><img src='images/valid-3.jpg'></td></tr></table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While the following would not be able to pass the grading:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<table><tr><td><img src='images/invalid-1.jpg'></td></tr></table>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run the next block of code to plot the metrics. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Get training and validation accuracies\n",
    "acc = history.history['accuracy']\n",
    "val_acc = history.history['val_accuracy']\n",
    "loss = history.history['loss']\n",
    "val_loss = history.history['val_loss']\n",
    "\n",
    "# Get number of epochs\n",
    "epochs = range(len(acc))\n",
    "\n",
    "fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
    "fig.suptitle('Training and validation performance')\n",
    "\n",
    "for i, (data, label) in enumerate(zip([(acc, val_acc), (loss, val_loss)], [\"Accuracy\", \"Loss\"])):\n",
    "    ax[i].plot(epochs, data[0], 'r', label=\"Training \" + label)\n",
    "    ax[i].plot(epochs, data[1], 'b', label=\"Validation \" + label)\n",
    "    ax[i].legend()\n",
    "    ax[i].set_xlabel('epochs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A more rigorous way of setting the passing threshold of this assignment is to use the slope of your `val_loss` curve.\n",
    "\n",
    "**To pass this assignment the slope of your `val_loss` curve should be 0.0005 at maximum.** You can test this by running the next cell:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_history(history)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**If your model generated a validation loss curve that meets the criteria above, run the following cell and then submit your assignment for grading. Otherwise, try with a different architecture.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "with open('history.pkl', 'wb') as f:\n",
    "    pickle.dump(history.history, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Congratulations on finishing this week's assignment!**\n",
    "\n",
    "You have successfully implemented a neural network capable of classifying sentiment in text data while doing a fairly good job of not overfitting! Nice job!\n",
    "\n",
    "**Keep it up!**"
   ]
  }
 ],
 "metadata": {
  "grader_version": "1",
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
